Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
Add managed-memory advise, prefetch, and discard-prefetch free functions#1775rparolin wants to merge 19 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
/ok to test |
|
question: Does making these member functions of the |
I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore. |
…ns in the cuda.core.managed_memory namespace
…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
I'm not against merging this change, but I think we need to revisit the Buffer design and seriously consider alternatives.
The main concern I have is that Buffer being flat leads us to create non-Pythonic free functions, such as the following:
managed_memory.advise(buffer, "set_preferred_location", 3, location_type="host_numa")
A more Pythonic interface would put managed-memory-specific features into a subclass; something like the following:
class ManagedBuffer(Buffer):
preferred_location: Device | Host | NumaNode | None
read_mostly: bool
accessed_by: set[Device | Host]
def prefetch(self, location, *, stream): ...
def discard_prefetch(self, location, *, stream): ...
Usage would be something like:
buf.preferred_location = Device(0) # set_preferred_location to device
buf.preferred_location = Host() # set_preferred_location to host
buf.preferred_location = Host(numa_id=3) # set_preferred_location to NUMA node
buf.preferred_location = None # unset_preferred_location
buf.read_mostly = True # set_read_mostly
buf.read_mostly = False # unset_read_mostly
buf.accessed_by.add(Device(0)) # set_accessed_by
buf.accessed_by.discard(Device(0)) # unset_accessed_by
buf.prefetch(Device(0), stream=s) # prefetch to device
buf.prefetch(Host(), stream=s) # prefetch to host
| cdef struct _MemAttrs: | ||
| int device_id | ||
| bint is_device_accessible | ||
| bint is_host_accessible | ||
| bint is_managed | ||
|
|
||
|
|
||
| cdef class Buffer: | ||
| cdef: | ||
| DevicePtrHandle _h_ptr | ||
| size_t _size | ||
| MemoryResource _memory_resource | ||
| object _ipc_data | ||
| object _owner | ||
| _MemAttrs _mem_attrs | ||
| bint _mem_attrs_inited | ||
| object __weakref__ |
There was a problem hiding this comment.
Not directly related to your change, but I think Buffer is too complicated. We should revisit the design.
| Device() | ||
| ret = cydriver.cuPointerGetAttributes(3, attrs, <void**>vals, ptr) | ||
| HANDLE_RETURN(ret) | ||
|
|
There was a problem hiding this comment.
Also unrelated: I don't think we auto-init CUDA anywhere else and I don't think the code should be this defensive.
There was a problem hiding this comment.
nit: we should refactor memory tests to a subdirectory: tests/memory/test_managed_ops.py and siblings.
|
@leofang Can you be a tie breaker here? Do you feel that these APIs should have an object orientated style and live on a cc: @Andy-Jost |
Summary
Add managed-memory
advise(),prefetch(), anddiscard_prefetch()as free functions under the newcuda.core.managed_memorynamespace, wrapping the CUDA driver APIscuMemAdvise,cuMemPrefetchAsync, andcuMemDiscardAndPrefetchBatchAsync.Closes #1332
Details
New public API —
cuda.core.managed_memorymodule with three functions:advise(target, advice, location, *, size, location_type)— apply managed-memory advice to a rangeprefetch(target, location, *, stream, size, location_type)— prefetch a range to a target locationdiscard_prefetch(target, location, *, stream, size, location_type)— discard and prefetch a rangeEach function accepts either a
Buffer(size inferred) or a raw pointer (requiressize=). Location can be specified as aDevice, int ordinal,-1for host, or with an explicitlocation_type("device","host","host_numa","host_numa_current"). Advice can be aCUmem_adviseenum value or a string alias like"set_read_mostly". Thestreamparameter onprefetchanddiscard_prefetchalso accepts aGraphBuilder.Location validation matches the CUDA driver spec:
set_read_mostly,unset_read_mostly,unset_preferred_location— location is optional; allowed types aredevice,host,host_numaset_preferred_location— all four location types validset_accessed_by,unset_accessed_by— onlydeviceandhost(rejectshost_numaandhost_numa_current)Backward compatibility — when
cuda.bindings < 13.0, the functions fall back to the legacycuMemAdvise(ptr, size, advice, device_int)/cuMemPrefetchAsync(ptr, size, device_int, stream)signatures. Enum lookups for the legacy path are cached to avoid repeatedhasattr/getattrcalls.Implementation notes:
_managed_memory_ops.pyxmodule undercuda.core._memory_buffer.pxdexposes_init_mem_attrs,_query_memory_attrs, and the_MemAttrsstruct (with a newis_managedfield) for use by the ops module_normalize_managed_locationhandles all location inference and constraint checking; each branch returns directly with no dead fallthrough codecuPointerGetAttributes(the existing_MemAttrsinfrastructure)cuda.core.managed_memorymodule re-exports the three functions from the Cython implementationcuda.core.experimental.managed_memoryTests
Adds coverage for:
advise/prefetch/discard_prefetchon managed-memory pool buffers and externally wrapped managed allocationsadvisewithCUmem_adviseenum values (not just string aliases)set_preferred_location;host_numa/host_numa_currentrejection forset_accessed_by-1→ host,0→ device)prefetchwithlocation=NoneraisesValueErrorsize=rejection when target is aBuffer(TypeError)get_binding_version)size=cuMemRangeGetAttribute